Online Markov Decision Processes With Kullback–Leibler Control Cost
نویسندگان
چکیده
منابع مشابه
Online Learning in Markov Decision Processes with Changing Cost Sequences
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a ...
متن کاملOnline Markov Decision Processes
We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which ha...
متن کاملOnline Markov decision processes with policy iteration
The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thu...
متن کاملl AVERAGE COST SEMI - MARKOV DECISION PROCESSES
^ The Semi-Markov Decision model is considered under the criterion of long-run average cost. A new criterion, which for any policy considers the limit of the expected cost Incurred during the first n transitions divided by the expected length of the first n transitions, is considered. Conditions guaranteeing that an optimal stationary (nonrandomized) policy exist are then presented. It is also ...
متن کاملOnline Regret Bounds for Markov Decision Processes with Deterministic Transitions
We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Automatic Control
سال: 2014
ISSN: 0018-9286,1558-2523
DOI: 10.1109/tac.2014.2301558